#packages here
require(tidyverse)
require(scales)
require(adegenet)
require(DiagrammeR)
require(Biostrings)
require(cowplot)
require(pegas)
require(GenomicRanges)
require(GenomicFeatures)
In this R notebook we map the O. mykiss GTseq panel used by the SFGL to the O. mykiss genome and examine LD patterns among the markers.
Skip to Results Summary for a visual abstract of results
This is an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the html file in a browser.
To conduct the analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio. All data and analyses are available in the github repository at https://github.com/david-dayan/Omy_GTseq_markers.git
Samples
Samples are 42 Rogue River adult winter run steelhead collected 2019 at Cole Rivers Hatchery
Genome
We use the ENSEMBL release 100 genome for mapping (GCA_002163495.1), however, we also recommend mapping amplicons to a draft release of the new assembly (GCA_013265735.1), or the final release of the new assembly once available.
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab3
tab2 -> tab3
tab3 -> tab4
tab3 -> tab5
}
[1]: 'Get Amplicon Sequences'
[2]: 'Probe+Primer Sequence'
[3]: 'Mapping'
[4]: 'Compare to Previous Mapping Results'
[5]: 'LD'
")
The basic approach is to gather amplicon sequences for using similar filtering criteria to GTseq pipeline, then map these sequences.
Several approaches were used, final results are presented from Final Approach
This approach maps achieves several goals:
In approaches 1 and 2 below, generating a consensus sequence for the amplicon before mapping will obfiscate any paralogous sequences that will be pulled by the gtseq pipeline, because reads from paralagous loci will be collapsed into a single marker’s inferred amplicon sequence. By aligning the full set of reads against the genome, we can check if any markers are multi-mapping before building consensus sequences and mapping.
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
# edge definitions with the node IDs
tab1 -> tab2 [label = 'concat']
tab2 -> tab3 [label = 'split into marker-level files by primer+probe sequence']
tab3 -> tab4 [label = 'align (bwa-mem)']
tab4 -> tab5 [label = 'merge bams']
tab5 -> tab6 [label = 'samtools: mpileup + consensus']
tab5 -> tab7
tab4 -> tab8 [label = 'marker level mapQ and genome positions']
}
[1]: 'demuxed fastq'
[2]: 'merged fastq'
[3]: 'marker-level fastqs'
[4]: 'marker-level bams'
[5]: 'merged bams'
[6]: 'amplicon consensus sequence'
[7]: 'vcf file'
[8]: 'mapping statistics/paralog ID'
")
First step is to generate by-marker fastq files from all samples in the dataset
#first step is to concatenate some fish from a single population in to a large fastq
# will use the Rogue adult winter fish (CORH and APPR adults at /dfs/FW_HMSC/Omalley_Lab/bohns/GTseq/OmyRogue/baseline/sample_fastqs/)
#from the fastqs directory
touch merged.fq
cat /dfs/FW_HMSC/Omalley_Lab/bohns/GTseq/OmyRogue/baseline/sample_fastqs/*CORH* >> merged.fq
cat /dfs/FW_HMSC/Omalley_Lab/bohns/GTseq/OmyRogue/baseline/sample_fastqs/*APPR* >> merged.fq
# get panel marker information
panel <- read_tsv("GTseq_panel_metadata/probe_seqs.csv", col_names = c("marker", "primer","probe1", "probe2"))
#sanitize probe sequences
#note that some (15) of the probe sequences are identical and degenerate ie AT[AC]GC for both sequences, lets sanitize this dataset so all will be in the same format
#given the low numbers just did it manualy in text editor (i.e. converted one to the first value and the second to the second)
#REVERSE COMPLEMENTS
panel$probe1rc <- sapply(panel$probe1, function(x) as.character(reverseComplement(DNAString(x))))
panel$probe2rc <- sapply(panel$probe2, function(x) as.character(reverseComplement(DNAString(x))))
panel2 <- panel %>%
unite(probes, c(probe1, probe1rc, probe2, probe2rc), sep = "|" )
panel2$primer <- paste("^", panel2$primer, sep="")
#write_tsv(panel2, "./GTseq_panel_metadata/gtseq_panel.txt", col_names = FALSE)
#next we'll filter the merged fastq into files for each marker based on the presence of the primer and probe sequence using seqkit for speed
#can also consider using bbduk and kmer matching to speed up if this is too slow
#set an SGE array with a task dedicated to each line of the probes_seqs.csv file
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-10
#$ -tc 2
#$ -N marker_split
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
SEQKIT="/dfs/FW_HMSC/Omalley_Lab/dayan/software/seqkit"
MARKER_NAME=$(awk -v "line=$SGE_TASK_ID" 'NR == line {print $1}' ../gtseq_panel.txt)
PRIMER=$(awk -v "line=$SGE_TASK_ID" 'NR == line {print $2}' ../gtseq_panel.txt)
PROBE=$(awk -v "line=$SGE_TASK_ID" 'NR == line {print $3}' ../gtseq_panel.txt)
cat merged.fq | $SEQKIT grep -s -i -r -p $PRIMER | $SEQKIT grep -s -i -r -p $PROBE > $MARKER_NAME.fq
Then we align to the genome with bwa-mem
#next step is to align to the genome
#first create bwa index
SGE_Batch -c "bwa index Oncorhynchus_mykiss.Omyk_1.0.dna.toplevel.fa.gz" -r bwa_index
# then align each of the 390 files in the fastq directory using the script below and call with:
# qsub -q harold -c "scriptname"
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-390
#$ -tc 10
#$ -N align
#$ -cwd
#Get demuxed files
FILES=($(ls -1 ../fastqs/*fq))
#set up the loop
FILENAME=${FILES[$SGE_TASK_ID-1]}
GENOME="/dfs/FW_HMSC/Omalley_Lab/dayan/genomes/Omykiss/Oncorhynchus_mykiss.Omyk_1.0.dna.toplevel.fa.gz"
SAMTOOLS="/local/cluster/bin/samtools"
BWA="/local/cluster/bin/bwa"
$BWA mem $GENOME $FILENAME | $SAMTOOLS view -bSu - | $SAMTOOLS sort - -o ./${FILENAME%.fq}.bam &> ${FILENAME%.fq}bwa_mem.oe
Next we analyze mapping statistics for each marker
#first check flagstat, then pull SAM records and check mapQ and if any multi-aligning reads were found
#for each marker file collect:
#number of reads, properly aligning, secondary alignments, suppl alignments, average mapQ and the presence of any multi-mapping reads using flagstat
touch bam_flagstats.txt
for i in *.bam; do
#SITE=$($i)
FLAG=$(samtools flagstat $i)
READS=$(echo "$FLAG" | awk 'BEGIN { FS = " " } NR==1 {print $1}')
PRIMARY=$(echo "$FLAG" | awk 'BEGIN { FS = " " } NR==5 {print $1}')
SECONDARY=$(echo "$FLAG" | awk 'BEGIN { FS = " " } NR==2 {print $1}')
SUPPL=$(echo "$FLAG" | awk 'BEGIN { FS = " " } NR==3 {print $1}')
MAPQ=$(samtools view $i | awk -F "\t" '{sum+=$5} END {print sum/NR}')
SECOND_ALIGN=$(samtools view $i | grep -e 'XA:Z:' | wc -l)
paste <(printf "%s" "$i") <(printf "%s" "$READS") <(printf "%s" "$PRIMARY") <(printf "%s" "$SECONDARY") <(printf "%s" "$SUPPL") <(printf "%s" "$MAPQ") <(printf "%s" "$SECOND_ALIGN") --delimiters '\t' >> bam_flagstats.txt
done
Lets examine the distribution of mapping results:
map_results <- read_tsv("mapping_results/bam_flagstats.txt", col_names = c("marker", "reads", "primary", "secondary", "suppl", "mapq", "second_align"))
b <- ggplot(data=map_results)+geom_histogram(aes(x=primary))+theme_classic()
d <- ggplot(data=map_results)+geom_histogram(aes(x=suppl))+theme_classic()
e <- ggplot(data=map_results)+geom_histogram(aes(x=mapq))+theme_classic()
f <- ggplot(data=map_results)+geom_histogram(aes(x=second_align/primary))+theme_classic()+xlab("proportion multimapping reads")
h <- ggplot(data=map_results)+geom_point(aes(x=(second_align/primary), y = mapq))+geom_smooth(aes(x=(second_align/primary), y = mapq), method = "lm")
i <- ggplot(data=map_results)+geom_point(aes(x=(primary), y = mapq))+geom_smooth(aes(x=(primary), y = mapq), method = "lm")
j <- ggplot(data=map_results)+geom_point(aes(x=(second_align/primary), y = primary))+geom_smooth(aes(x=(second_align/primary), y = primary), method = "lm")
plot_grid(b,d,e,f,h,i,j, ncol =2)
#note "secondary" column from flagstats always = 0 and reads always = alignments, so skipped
multimap_50 <- sum((map_results$second_align/map_results$primary) > 0.5, na.rm = TRUE)
bad_map <- sum(map_results$mapq<3, na.rm = TRUE)
bad_mapXmultimap <- sum(map_results$mapq<3 | (map_results$second_align/map_results$primary) > 0.5, na.rm = TRUE)
bad_markers <- map_results[((map_results$second_align/map_results$primary) > 0.5), 1]
Hmmm, lets check if the “multimapping” reads are reliably multimapping. Is the secondary alignment relatively good compared to the primary alignment?
#This collects the average alignment score of (the secondary alignment divided by the primary alignment)
#!/bin/bash
#$ -S /bin/bash
#$ -N bad_markers
#$ -cwd
BAD_MARKERS="Chr28_11667578.bam
OMS00096.bam
OMS00149.bam
Omy_110689-148.bam
Omy_116938-264.bam
Omy_128693-455.bam
Omy_anp-17.bam
Omy_BAMBI2.312.bam
Omy_carban1-264.bam
Omy_cd28-130.bam
Omy_crb-106.bam
Omy_inos-97.bam
Omy_myclarp404-111.bam
Omy_nips-299.bam
Omy_RAD13034-67.bam
Omy_RAD1751-18.bam
Omy_RAD19578-59.bam
Omy_RAD30243-74.bam
Omy_RAD32139-58.bam
Omy_RAD36848-7.bam
Omy_RAD39156-33.bam
Omy_RAD43117-55.bam
Omy_RAD55997-10.bam
Omy_RAD619-59.bam
Omy_RAD66402-36.bam
Omy_RAD78147-27.bam
Omy_RAD86706-72.bam
Omy_stat3-273.bam"
touch marker_mapping_ratios.txt
for i in $BAD_MARKERS
do
samtools view $i | grep "XA:" | awk '{print $14,$15}' | sed 's/ /:/g' | awk 'BEGIN {FS=":"} {sum +=$6/$3} END {print sum/NR}' >> marker_mapping_ratios.txt
done
map_ratio <- read_tsv("mapping_results/marker_mapping_ratios.txt")
ggplot(data=map_ratio)+geom_histogram(aes(x=Aln_Score_Ratio))+theme_classic()+xlim(c(0,1))
Yes the secondary alignments are nearly as good as the primary alignments.
Marker Mapping Stats Results summary:
(1) left skew with long right tail on number of primary reads, suggesting there may be some markers drawn from multiple genomic regions (i.e. much higher coverage than average suggests paralogs), however, important to remember that variance in GTseq primer characteristics might lead to high variance in amplification rates so this might not be the best paralog filter compared to other types of NGS data
(2) Most markers have perfect or near perfect mapQ and no secondary (multiply aligning) or supplemental aligning reads, however some markers have high proportion of multiply alignments (28 with >50%), and low mapQ (16 with mapQ less than 3)
(3) Poor mapq and proportions of multiply aligning reads are strongly correlated, number of reads and poor mapq as well as proportion of secondary aignments and number of reads are NOT strongly correlated highlighting that covreage may be a poor way to check for paralogues
Do poorly mapping markers also demonstrate an excess of heterozygotes?
Big caveat here: allele correction values may have been used to skew results away from excess homozygotes at paralogs, so not exactly a robust way to identify them…
Using a previous GTseq genotyping run (Rogue River baseline 2019) we compare observed vs expected heterozygosity. This dataset only genotyped 330 markers (others removed during filtering), so some markers must be analyzed later.
load("2019_Rogue_gtseq_baseline/genind.R")
#separate pops to avoid Wahlund effect
n.pop <- seppop(genind)
#collect observed vs expected het for winter fish
hobs <- summary(n.pop$Winter)$Hobs
hexp <- summary(n.pop$Winter)$Hexp
het <- as.data.frame(cbind(hexp, hobs))
HWE <- hw.test(n.pop$Winter)
HWE <- as.data.frame(HWE) %>%
mutate(marker = row.names(HWE))
#join these results with mapping results and HWE
map_results$marker <- gsub("*.bam","",map_results$marker)
het <- het %>%
mutate(marker = row.names(het)) %>%
left_join(map_results)
het <- het %>%
mutate(multimap = if_else(second_align/primary > 0.5, 'multimap', 'single_map'))
het <- het %>%
left_join(HWE) %>%
mutate(sig_HWE= if_else(`Pr(chi^2 >)` < 0.05, 'outHWE', 'HWE'))
sum(het$sig_HWE=="outHWE" && het$multimap=="multimap")
## [1] 0
ggplot(data=het)+geom_point(aes( hexp, hobs, color= multimap))+geom_abline(slope=1, intercept = 0)+theme_classic()+scale_color_viridis_d()
No markers that demonstrate significant departure from HWE (p < 0.05) are also multimapping. However only 21 of the 28 multimapping markers overlap with this GTseq run. May want to run GTseq ourselves and check again.
Here we generate amplicon consensus sequences that can be used for future mapping projects.
merge bams -> variant call and index -> consensus sequence
individual bams -> bed for each marker with consensus sequence -> consensus amplicon
#first merge all bams to a single file
samtools merge all.bam *.bam
#variant calling script (need to call variants to generate consensus sequence in samtools framework) (make sure to set bash as shell in the command)
#$ -S /bin/bash
#$ -N consensus sequence
#$ -cwd
#$ -e error.txt
GENOME="/dfs/FW_HMSC/Omalley_Lab/dayan/genomes/Omykiss/Oncorhynchus_mykiss.Omyk_1.0.dna.toplevel.fa.gz"
BCFTOOLS="/local/cluster/bin/bcftools"
$BCFTOOLS mpileup -Ou --max-depth 10000-f $GENOME all.bam | $BCFTOOLS call -mv -Ob -o calls.bcf
#index
$BCFTOOLS index calls.bcf
#generate consensus
cat $GENOME | $BCFTOOLS consensus calls.bcf > consensus.fa
# this script takes individual bam files and makes a bed file for the
# #!/bin/bash
#$ -S /bin/bash
#$ -t 1-390
#$ -tc 20
#$ -N consensus
#$ -cwd
#$ -e error.txt
#set some shortcut variables
GENOME="/dfs/FW_HMSC/Omalley_Lab/dayan/genomes/Omykiss/Oncorhynchus_mykiss.Omyk_1.0.dna.toplevel.fa.gz"
BCFTOOLS="/local/cluster/bin/bcftools"
BEDTOOLS="/local/cluster/bedtools2/bin/bedtools"
SEQKIT="/dfs/FW_HMSC/Omalley_Lab/dayan/software/seqkit"
#Get demuxed files
FILES=($(ls -1 ../alignments/*bam))
#set up the loop
FILENAME=${FILES[$SGE_TASK_ID-1]}
#just bam records of correct length
$BEDTOOLS bamtobed -i $FILENAME > ${FILENAME%.bam}.bed
$BEDTOOLS getfasta -s -fi consensus.fa -bed ${FILENAME%.bam}.bed -fo ${FILENAME%.bam}.fa
AMPLICON=$($SEQKIT seq -M 102 -w 0 ${FILENAME%.bam}.fa | perl -e 'while (<>) {$h=$_; $s=<>; $seqs{$h}=$s;} foreach $header (reverse sort {length($seqs{$a}) <=> length($seqs{$b})} keys %seqs) {print $header.$seqs{$header}}' | head -2 | tr "\n" "\t") # this collects the consensus sequence for the interval spanning the longest read that aligned to the genome so long as that interval is shorter than the read length (101bp)
paste <(printf "%s" "$FILENAME") <(printf "%s" "$AMPLICON") --delimiters '\t' >> amplicons.txt
#then cleaned up this file with regex and saved as mapping_results/amplicons.txt
amps <- read_tsv("mapping_results/amplicons.txt")
amps$start <- as.numeric(amps$start)
amps$end <- as.numeric(amps$end)
Are any markers overlapping?
rangs <- makeGRangesFromDataFrame(amps, seqnames.field = "chr", start.field = "start", end.field = "end", strand.field = "strand", keep.extra.columns = TRUE)
rangs[countOverlaps(rangs)>1]$marker
## [1] "Omy_RAD15709-53" "Omy_RAD47080-54"
Omy_RAD15709-53 Omy_RAD47080-54 overlap, let’s look more closely at them.
amps[amps$marker==c("Omy_RAD15709-53" ,"Omy_RAD47080-54"),]
They cover the exact same regions of the genome, do they also share the same probe sequence?
Yes, same probe sequences capture the same SNP, just different intervals around it. Aligned sequence here:
ATGCAAGGCTTAAA| TTTAAGCCTTGCAT|ATGCAAGACTTAAA| TTTAAGTCTTGCAT
TGCAAGACTTAAAACGA|TCGTTTTAAGTCTTGCA| TGCAAGGCTTAAAACGA|TCGTTTTAAGCCTTGCA
What about longer distances?
rangs[countOverlaps(rangs, maxgap=1000)>1]$marker
## [1] "Omy_arp-630" "Omy_G3PD_2-371" "Omy_G3PD_2.246"
## [4] "Omy_LDHB-2_e5" "Omy_LDHB-2_i6" "Omy_mapK3-103"
## [7] "Omy_myclarp404-111" "Omy_Omyclmk438-96" "Omy_RAD15709-53"
## [10] "Omy_RAD47080-54"
overlaps_1k <- findOverlaps(rangs, maxgap=1000, drop.self=TRUE, drop.redundant=TRUE)
cbind(c(rangs[overlaps_1k@from]$marker), rangs[overlaps_1k@to]$marker)
## [,1] [,2]
## [1,] "Omy_arp-630" "Omy_myclarp404-111"
## [2,] "Omy_G3PD_2-371" "Omy_G3PD_2.246"
## [3,] "Omy_LDHB-2_e5" "Omy_LDHB-2_i6"
## [4,] "Omy_mapK3-103" "Omy_Omyclmk438-96"
## [5,] "Omy_RAD15709-53" "Omy_RAD47080-54"
kb_markers <- unique(unlist(as.data.frame(cbind(c(rangs[overlaps_1k@from]$marker), rangs[overlaps_1k@to]$marker))))
length(kb_markers)
## [1] 10
arrange(amps[amps$marker %in% (kb_markers),], chr)
overlaps_10k <- findOverlaps(rangs, maxgap=10000, drop.self=TRUE, drop.redundant=TRUE)
cbind(c(rangs[overlaps_10k@from]$marker), rangs[overlaps_10k@to]$marker)
## [,1] [,2]
## [1,] "Chr28_11671116" "Omy_RAD15709-53"
## [2,] "Chr28_11671116" "Omy_RAD47080-54"
## [3,] "Chr28_11671116" "Chr28_11676622"
## [4,] "Chr28_11676622" "Omy_RAD15709-53"
## [5,] "Chr28_11676622" "Omy_RAD47080-54"
## [6,] "Chr28_11676622" "Chr28_11683204"
## [7,] "Chr28_11607954" "Omy_GREB1_05"
## [8,] "Chr28_11632591" "Chr28_11625241"
## [9,] "Chr28_11658853" "Omy_RAD15709-53"
## [10,] "Chr28_11658853" "Omy_RAD47080-54"
## [11,] "Chr28_11625241" "Omy_GREB1_05"
## [12,] "OMS00062" "Omy_117370-400"
## [13,] "Omy_arp-630" "Omy_myclarp404-111"
## [14,] "Omy_crb-106" "Omy_CRBF1-1"
## [15,] "Omy_G3PD_2-371" "Omy_G3PD_2.246"
## [16,] "Omy_LDHB-2_e5" "Omy_LDHB-2_i6"
## [17,] "Omy_mapK3-103" "Omy_Omyclmk438-96"
## [18,] "Omy_RAD15709-53" "Omy_RAD47080-54"
## [19,] "Omy_RAD46672-27" "Omy_RAD58213-70"
kb10_markers <- unique(unlist(as.data.frame(cbind(c(rangs[overlaps_10k@from]$marker), rangs[overlaps_10k@to]$marker))))
length(kb10_markers)
## [1] 24
kb10 <- amps[amps$marker %in% (kb10_markers),]
arrange(kb10, chr, start)
There are also 5 pairs of markers within 1kb of one another (10 markers), and 19 pairs within 10kb (24 markers).
For the 1kb pairs, pairs occur on 4 chromosomes, (2,4,7,28) over 5 regions (there are two regions on chromosome 7 about 47Mb apart).
For the 10kb pairs, pairs occur on 7 chromosomes (2,4,7,16,17,28,29) over 8 regions (two regions on chromosome 7 about 47Mb apart)
Plots
chrom_sizes <- read_tsv("GTseq_panel_metadata/chr_sizes.txt")
amps$chromosome <- amps$chr
colnames(chrom_sizes) <- c("chromosome", "size")
# create an ordered factor level to use for the chromosomes in all the datasets
chrom_order <- c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14",
"15", "16", "17", "18", "19", "20", "21",
"22", "23", "24", "25", "26", "27", "28", "29", "MSJN01077273.1")
chrom_key <- setNames(object = as.character(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30)),
nm = chrom_order)
chrom_order <- factor(x = chrom_order, levels = rev(chrom_order))
# convert the chromosome column in each dataset to the ordered factor
chrom_sizes[["chromosome"]] <- factor(x = chrom_sizes[["chromosome"]],
levels = chrom_order)
amps[["chromosome"]] <- factor(x = amps[["chromosome"]],
levels = chrom_order)
allmarkers <- ggplot(data = chrom_sizes) +
# base rectangles for the chroms, with numeric value for each chrom on the x-axis
geom_rect(aes(xmin = as.numeric(chromosome) - 0.2,
xmax = as.numeric(chromosome) + 0.2,
ymax = size, ymin = 0),
colour="black", fill = "white") +
# rotate the plot 90 degrees
coord_flip() +
# black & white color theme
theme(axis.text.x = element_text(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank()) +
geom_rect(data = amps, aes(xmin = as.numeric(chromosome) - 0.2,
xmax = as.numeric(chromosome) + 0.2,
ymax = end+50000, ymin = start-50000), fill="red")+
# supress scientific notation on the y-axis
scale_y_continuous(label=comma) +
scale_x_discrete(name = "chromosome", limits = names(chrom_key)) +
ylab("region (bp)")
What about the dense regions? Can we make a plot with genes and take a look at what we’re tagging.
require(karyoploteR)
## Loading required package: karyoploteR
## Warning: package 'karyoploteR' was built under R version 3.6.2
## Loading required package: regioneR
## Warning: package 'regioneR' was built under R version 3.6.2
chrom_sizes$start <- rep(1, 30)
Omy.genome <- toGRanges(data.frame(chr=chrom_sizes$chromosome, start=rep(1, 30), end=chrom_sizes$size))
Let’s Look at a handful of individual chromosomes
kp <- plotKaryotype(genome = Omy.genome, chromosomes = "28")
kpAddBaseNumbers(kp)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker)
kp <- plotKaryotype(genome = Omy.genome, chromosomes = "7")
kpAddBaseNumbers(kp)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker)
kp <- plotKaryotype(genome = Omy.genome, chromosomes = "7")
kpAddBaseNumbers(kp)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker)
#note this will not work for cloned repositories because thegff file is not in the repo, must download the gff yourself
db28 <- makeTxDbFromGFF("2019_Rogue_gtseq_baseline/Oncorhynchus_mykiss.Omyk_1.0.100.primary_assembly.28.gff3.gz")
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
greb1_region <- toGRanges(data.frame("28", 11580000, 11800000))
pp <- getDefaultPlotParams(plot.type=2)
pp$bottommargin <- 1
pp$data1height <- 400
kp <- plotKaryotype(plot.type =2, genome = Omy.genome, zoom=greb1_region, plot.params = pp)
kpPlotGenes(kp, data = db28, r0=0, r1=0.2, add.gene.names = TRUE, add.transcript.names = FALSE, add.strand.marks = FALSE, data.panel = 2, plot.transcripts = FALSE, gene.name.position = "bottom", gene.col = "#FDE725FF", clipping = FALSE)
kpAddBaseNumbers(kp, tick.dist = 100000)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker, data.panel = 1, marker.parts = c(0.1,0.1,0.1), clipping = FALSE, ymax=3)
greb1_region <- toGRanges(data.frame("28", 11580000, 11800000))
pp <- getDefaultPlotParams(plot.type=2)
pp$bottommargin <- 1
pp$data1height <- 400
kp <- plotKaryotype(plot.type =2, genome = Omy.genome, zoom=greb1_region, plot.params = pp)
kpPlotGenes(kp, data = db28, r0=0, r1=0.2, add.gene.names = TRUE, add.transcript.names = FALSE, add.strand.marks = FALSE, data.panel = 2, plot.transcripts = FALSE, gene.name.position = "bottom", gene.col = "#FDE725FF", clipping = FALSE)
kpAddBaseNumbers(kp, tick.dist = 100000)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker, data.panel = 1, marker.parts = c(0.1,0.1,0.1), clipping = FALSE, ymax=3)
First let’s get al the SFGL results into one place
map_results$marker=gsub("*.bam","",map_results$marker)
lab_results <- read_tsv("GTseq_panel_metadata/gtseq_panel_info_SFGL.txt")
master <- lab_results %>%
dplyr::rename(marker=Assay) %>%
full_join(amps) %>%
full_join(panel2) %>%
full_join(map_results)
#a set of markers is named differently despite being the same marker across these spreadsheets, markers in most doc refer to these as Omy28_postion, lets rename them to match
to_rename <- master$marker[c(392:401)]
new_name <- gsub("Chr", "Omy",master$marker[c(392:401)])
lapply(1:10,FUN = function(i){amps[amps == to_rename[i]] <<- new_name[i]})
## [[1]]
## [1] "Omy28_11671116"
##
## [[2]]
## [1] "Omy28_11676622"
##
## [[3]]
## [1] "Omy28_11607954"
##
## [[4]]
## [1] "Omy28_11683204"
##
## [[5]]
## [1] "Omy28_11702210"
##
## [[6]]
## [1] "Omy28_11632591"
##
## [[7]]
## [1] "Omy28_11658853"
##
## [[8]]
## [1] "Omy28_11773194"
##
## [[9]]
## [1] "Omy28_11625241"
##
## [[10]]
## [1] "Omy28_11667578"
lapply(1:10,FUN = function(i){panel2[panel2 == to_rename[i]] <<- new_name[i]})
## [[1]]
## [1] "Omy28_11671116"
##
## [[2]]
## [1] "Omy28_11676622"
##
## [[3]]
## [1] "Omy28_11607954"
##
## [[4]]
## [1] "Omy28_11683204"
##
## [[5]]
## [1] "Omy28_11702210"
##
## [[6]]
## [1] "Omy28_11632591"
##
## [[7]]
## [1] "Omy28_11658853"
##
## [[8]]
## [1] "Omy28_11773194"
##
## [[9]]
## [1] "Omy28_11625241"
##
## [[10]]
## [1] "Omy28_11667578"
lapply(1:10,FUN = function(i){map_results[map_results == to_rename[i]] <<- new_name[i]})
## [[1]]
## [1] "Omy28_11671116"
##
## [[2]]
## [1] "Omy28_11676622"
##
## [[3]]
## [1] "Omy28_11607954"
##
## [[4]]
## [1] "Omy28_11683204"
##
## [[5]]
## [1] "Omy28_11702210"
##
## [[6]]
## [1] "Omy28_11632591"
##
## [[7]]
## [1] "Omy28_11658853"
##
## [[8]]
## [1] "Omy28_11773194"
##
## [[9]]
## [1] "Omy28_11625241"
##
## [[10]]
## [1] "Omy28_11667578"
master <- lab_results %>%
dplyr::rename(marker=Assay) %>%
full_join(amps) %>%
full_join(panel2) %>%
full_join(map_results)
#now merge with CRITFC_mapping results
#first had to clean up the marker column in this spreadsheet, markers with multiple secondary alignments were split into separate rows and had marker coulmn renamed to reflect, I collapsed marker names back to original and kept only best hit, and marked any reads with seoncdary alignments with new column
CRITFC <- read_tsv("shared_info/CRITFC_map_results.txt")
master2 <- master %>%
full_join(CRITFC)
#write_tsv(master2, "./final_mapping_results.txt")
#count how many markers were successfully aligned in both
sum(!rowSums(is.na(master2[,c("chr", "CRITFC_chromosome")])))
## [1] 352
sum(as.numeric(master2$chr)==as.numeric(gsub("omy","",master2$CRITFC_chromosome)), na.rm = TRUE)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## [1] 335
ggplot(master2)+geom_jitter(aes(mapq, CRITFC_mapQ), alpha = 0.5)+theme_classic()+geom_smooth(aes(mapq, CRITFC_mapQ), method= "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 48 rows containing non-finite values (stat_smooth).
## Warning: Removed 48 rows containing missing values (geom_point).
ggplot(master2)+geom_point(aes(second_align/primary, CRITFC_secondary_maps), alpha = 0.5)+theme_classic()+geom_smooth(aes(second_align/primary, CRITFC_secondary_maps), method= "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).
diffs <- (as.numeric(master2$chr))!=(as.numeric(gsub("omy","",master2$CRITFC_chromosome)))
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
disagree <- master2[diffs,]
disagree <- disagree[rowSums(is.na(disagree)) != ncol(disagree), ]
master2$chr_dis <- (as.numeric(master2$chr))!=(as.numeric(gsub("omy","",master2$CRITFC_chromosome)))
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
master2$proportion_second <- master2$second_align/master2$reads
summary(aov(data=master2, chr_dis ~ proportion_second))
## Df Sum Sq Mean Sq F value Pr(>F)
## proportion_second 1 0.925 0.9255 23.98 1.49e-06 ***
## Residuals 348 13.432 0.0386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 51 observations deleted due to missingness
Of the 352 markers that were successfully mapped both here and by the CRITFC lab, 335 aligned to the same locus on the reference genome. Mapping quaility was highly correlated between the two mapping analyses, markers identified as potentially multi mapping were also in good agreement. Markers with disagreements in mapping position were enriched for markers with a high number of seconday alignments.
In previous mapping project for Chinook the following approach was taken:
#!/bin/bash
seqs=(
GCCTAGGTATGTACGAAACTTCACA
CCACTGGCTGTGGAGCTT
GCAGGGAAAACTGGTCAGGA
GGTCACGATCGTGGGGTC
CGCAATGAGCCAACCCCT
TGCAGGCATCATGCTTAATAACT
CGGAAGACCAGATTCTCCAAGAGTA
TTTTCTAGGACAGGTTGCTTGCA
TGAAAGATATCAATTGTAGTAGTGGTGGTG
ACTTCTTGAGCCAATCGGATGATG
GCGACTTGACAACGAGGAGAA
CATTCCATGACAATGATTGAAATCTAAAAACAC
GCCTACTGATAAATGTATGACAGTAATGGA
CCAGCAGAGACTGGGTTCAC
TGGGACAGAGGTGGGAATTGA
CTCTGCCATTCATTTGGGCTTTG
ACCACCCACCTCCTCAGA
CAAACGCGCACTCACACACA
CAGCTGCTGCAGTCAATGAG
GCGTTACTGGTGTTATAAACGTTAGC
CCTGAGCATCCCAGTTGAACT
AGTACAAGTGCAGAGAATGACATCATG
CGATGTACTGAGGGCAGTGT
GACTGTCTTGGAACCGTTGCTA
GAACTGAGCGGCTGCTG
TGTGTACATCCGCGTAAATATTGAAGATAA
CCTCAGTGTTATTTGTATATGATCATTTTGAAACATTT
TTGTTCAATGGGCATTAATGCATGTT
CAAGGGCACATTGGCAGATTTT
ACTCTATCATCGGCAGGACCAT
ATCGAGGATGCCTCAAAGACATC
CCCACTTCCAGAGCCTGAA
GCCCTTGTGACAATGCACTGTTATA
GTGATGAGAGGTTTCCGGAAAATCT
CTCCCTTGCTTTTGGTCATTGG
CAGGCTTGTGTTAAGTAGGGAGAAA
GAGGTTTGTTACTGTCACCCATAGA
CCTTTTTCTTATTAGTTTTACTTCCCCAGAGA
TGAAATAAATTGTTCTGTTGATATGTGAATTTTGGA
GCCAGATAGTAGCGTACATCATGAG
TCTCCCTCATTCCCATGTCATATCA
AACAAAGAATGTTAAACACCAAACAGGAA
GTTTGGCTATTGAAATTATACATTAAAACATGTAGCT
CTCTTGTTTGCTATGGGAGATGTAGT
GAGTGGTCAAGGTTTCAGTTTCTG
GTATAAACTAGAGTCCAGTGTTATGTTAATGTCTT
TTGTGGAATTACACCTTCAGAGTTCAAT
CCATGCGCCTTTGAGGAAATTAA
CTGAGCTTTTTCAACTTACTTGTTGGA
CTGCATGAACGTTAACTCAAATAAAAGGT
GCATGGCTGCCCTAGAACA
GTGGGTAATCGATGCCAAAGAGAT
CATAGATGTTTATATGAAAAACCTCCCACTGT
GCCTACAGCAAATTCAGCTACACAT
GAGGCCTAATGTCTCTTGTGACT
CCCAAGTGGTGAGTGTCAGT
GGAGGTGTAGTGAAATGGGAAGAT
GTGACAGGAGACAGAAAGAGACATT
ACACCCACTTCAACCTCCATAAC
TGCAAACACAGAGGAAAGGGATTT
TCATCAAAACATGCCTCTTCTGTGT
TGCGCGTCTCATTCAACCAT
CCATACAGCCAGTCCAGGTG
ATTTTCAAACAGGCATTTATCATTGGTGAA
CAGACAGGTCACCATCACACT
GCACCGTATCAACGAGCTCAT
CTCAACAGTGCACCTCCCTTAATT
TCGCTAGGCAGAAATATAGGGTTCT
AGTTGTTCTTTTTATATTGTGTTTTTATTCCATTCCA
GGATGGTTGTCATTTCTCTGCAAA
TGGAGAACTTGCACTGAATGTGAAA
CTGCTGGCGCAGACATG
GGTTGCAGGGCAGAACTGT
TCAATGTTCATCAATGCACTTCCTGTA
CGTGTCCGGCTTCTTTTATTTCATT
CTGTTAGTGCAGAAGACGTAGCT
AACCCTATGGGAACTCGTAGAACT
TGGGACCCACATAAAGCAACTG
GCATGTAACACATTATTTGGCATATGTACT
CGGTCATTGTAAATGTCAACGGTTT
CCTATTTTTGATAGGTCATAGTGAATGGGATAG
TGATTGTCTCATGGCCAATTGTCA
GGCTCGAACCACCCAGTTTA
GGCACTCTCCCTGGCTAGA
CCGTCTGAGTAGGAGGATCAATACA
GTCTCTCTCTCTTTGCATCATTACACT
GCTCTTGCCCATCTGTAGGAT
GATCATGTCAGATAGGATGCTGAAAGT
TCTCCTGAACTAATTTAGACCTCTGAATGT
CCTGAACAAATACTTAACGCTCCAGTT
TTCCCTAATCTGACGTACTACCAACT
CTGCGTTTCTGGAATGTTTTCAGT
GCAATGGCATGACAATGGAAGTC
TGACAGATTTCACCTTTAACTAGCTAAGC
CGTGGAGTAGGTGGTTACAGTTTAT
GCCATGGAGGACTGGATGA
GCGATCAGGTGACGCTAAAATGA
TTTGTGCGTAAAGTCAGGTAGTGT
GCCCTGGAGAAGTACGTTTTAAACTAA
CCGACGCCTCACTGAGT
CCTGAACAGGTACACACAAACGA
ACTGCCACAGACACGAACTC
ACTCTGGGTCCAGGAGGTTTT
CACTTGTTCTGCACACTACTTGTC
CTGCTCACCTGCATCAGTGT
TGTTTATCTCTGAGTGAAAAAGGTGTGT
TGCATGTTTTCTAACTGTGTTTTTGTGT
AGGTCCTCTGTCGCACCTA
GGACAATCGAAGACGAAAATATGCA
GCTGTGATTGTGCTCTAAAGACATG
CACTGAACTGTAAGCCATTGTGATT
CAACGCGGGAATGGCTTTTAA
CCGTTGCAGGACTCATCAGT
TGGGATAGAACAGGAGCTTAAACA
TGCAGGAACTTGCTATGCT
GCTGTTATAGCGCGGACTCA
TGCAGGAGGAGGAAGGCA
TGTGTATTCGTCGACCGGA
TGCCGCTGGATTTATTGACA
GCAGGGCGCAAAGTTCTT
GGCACAGCGACAGGAGTT
CCTCTGCTGAGTTTGAGGGG
GGGAGGCAGGCAAAAGGT
CTGCTTGTAGCCGTTCAGC
TGCAGGTGGGACTTAAACACA
ATCAGGTCTGGGGCGACA
GCCATATCCCGGGGCTTG
TGCAGGAGAGCAGGGTAGA
ACTGCAGGCGTCATGCTT
TGCAGGAGCTGTGATGGG
GGGAGAGGGAGACGTGGA
GGGCCACGGGGTTGTAAA
TGCAGGAGAAGCTGACTGAC
GGAGGCTCTACGTAGGCCT
CAGATGGTGCAGGCCGAA
GCAGGGGCAGACTGAAGG
CTTGCGCGGCAGTTGAAC
TGCAGGGTGGAAGAATTCATC
TGCAGGCTGACTTGGGTA
CGACCATGGCCCCCAATTCAT
CAGTTCGCTTCTCCAGGGA
TGCCTAAACACTCCCAAGGT
TGCAGGAAGAGTTCAGAGAAATCT
TGCAGGACCAACTTTCTCAT
CAAAGTGCAGGTGCTGGC
GGCGACACACACAGGGTT
GTGCCGCAACAGAGTGGA
TGACCCTCCAGCGTGACT
AAGCTATGCAGGCGACGG
CGCAAGTCAGCAGGGTGA
GCAGGGTCTGTGTGGGTT
CAGGAACCTGCTTTAATGCTCT
CTCCCTGTTCGCTAGCCG
AGATGCAGGAGGCTCTGGA
TGCAGGGTTGGGGACAATT
ACAGAGCTGTGTCTACCAGA
TGCAGGGGGACAAGAGAGA
TGCAGGGACGGGGCT
TGCCGTGAGAAACTGGTCA
CAGGCAGTCACTGAGTCCG
GCCAAGTGATCAAGTGCTTGT
ACTCTCCCAGAAGGATTCAGAGA
GCCATTTGACCAACGGAGC
GCAGGAAGCAAAGTTCGGTG
GCTGACCACCGACCACAG
ACACATGGCTCGTCTGCA
GCAGGGACAGGGCCCT
CCTGCTCTGTGTCTGGGC
AGTGCAGGTCTCCAGATTTACA
GAATGCAGGGCCAGGGAG
ATGACCAATTGAAGAGTTCTTCCGT
GGTGCCACTTTAGTATAGCTGCTTA
CCTTTGGGTCTGCTTGAGGTT
GCCCTGCCTGCAACTTC
GGTGATTTTGCCACAGAGTAGAGAT
GGACTCGTGCTTGAGGAAGATG
TCTGAACTCACCAAAGGAACACTTG
GTTCGTGGGATTGTTCAATGTTCAT
TCAAAAATGTCTATCCAACAAATACTCTGAAAAATATTG
GAGACAAAGGTTTGCAGGTTCATG
GTTCTTTTTAATGATGACTACAGGTCTTTCAC
CTTTTCTGAATTAGTGCTGTGCTTGT
GCGTACTGAGCCTGGATGACA
CAGATGAAAAATAAATAATTGGGCCATTAGGAA
CACTAAATATTCCTTATCATTTCATACTAAGTCTGAAGAA
GGTGATAACAGGTGTTGCACCAA
GGAGAACATGCATCACCATTCAAG
CAGCCCGTCCCAAAATCAAG
CACAGGAAGGACGTGTTTTGATG
CAAGAACACCGAGATCTCCTTCA
TGCTTCAGTGAAAATAAGCGTGAGA
TCTTTGATATTGAGCTCATAAAAGCAAGGT
TGCATCCATTCATACCTGACCAATT
TTGAGAACATGTGGTAATTAACTACAATGACTAA
TAGGAGTTGGAAAGACTGCACA
ACAGTATACCGGCTGCCTATTCATA
CACCTTAGTTCCACGCAACATG
GGTAGGCCGTCAGTGTAAAATAAGT
GAGGCTGACTTGGACTTTGC
CCTCCAGATGAGACCCACTCT
CGTGGTGTTCGCCTTCCT
GACTCAGGTAAGGAAACATCAATGTCA
CACCTGAACCTCCACTGTGT
CAATTACTCTTTCTCAGCCCTGTGT
CGTGACCCTTGTAACTGAAAAGC
TGTTGTCTCGGACTGCATGAC
GATCATTTATCAAGACTATAGGCTATGGATACG
GTCCACATTCTCCAGTACATGTATGG
GTCCTCAGCTGGGTCAAGAG
CAAGGGATGTGACAAATTAATCAAACACATAA
CCTTAGCTGCTCTTTGAAGTTGACT
CTCCCCCCTGGACTTTGG
GTGTGTGTGTGTGTGTGTCATCGT
TGCCACCTCAGTTTTAGTGTTATATCC
CTCACTGCAAATCCAACTTCATCAT
CCGTCCACAGCACAAGACTATAATA
CATTTAGCAGACACTCTTATCTTAGTGTCA
GTGCTGCAGGAACCATGTG
GGCCATCTTTCAGGACGTACAG
TGCAGTTACAAGCCTAAGACAATCT
CCAGCCCCGTAACACACAT
CGCTGGGCATGGATGAGT
GGTCTGTCTGTCTGTCTATCTGTCAATG
AAATGAGGCCGTCCTTTACACT
GCCGAAAAATAAGCGATTAGTGATGA
CGGACAAAGAGCTACAGAAATGC
ATGTCAATATATTTCACTATAATGATTGGAAGCCA
TGAGCGAGATTTATCAAACTGTCAAAGA
GGAACTTCCTCTCCCGTTCTG
AGTCAGTGTTGGTGTAGTGAAGAGA
AGAGCATTCAATTTAAAAGCTGAAAACGA
CTCATACTTTGTACCTGTGTGTTCCA
GCATTACTAAAAACTGGTGTGTGGAA
CTCTTGCTACTTGCAGTGTATCTCA
TGTTTTTGGTCATGTATTTTCTCTGCTATTTTT
CCTGGTCTGTTTGTGATCAAGATG
TCTTTGGACTGTGTATACCAGGTGTA
CATTTCCACGAAAAGCCAGATGAC
TCATAAACATGGTGTCTTTCAGTCAGTT
TTCTGGGTTGCCATACTCTTTCAAT
AAGGTCTACTCCGGTTGTATTCGGT
TGCCATCATAAACAACCTAACAAGTAACT
CCAAATACAGACCAGCTACTTGTGT
GTCGATTACCGTTAGCTTCATCCT
CTAAGTTCTTCCTGCCTAATGTGGAT
AATATTGGCTTTCTGAGAATGCATTTGG
CCATTCCCATCGGCATCGT
TGTGTTTAGGATTGAACTGACCATGTT
TCAAAGACATCGAACACAAGAACGA
TTTCTCATCCTTCTCTCTTCCAGTCT
ACCAGTACCTAAACGTTAGAAAGCAA
GCCTCACATTTTACTGATGTCACTTC
TTTTAAAAATGGAGATAAACTCCTGACCTGAA
TGCACCTGCGAGAGCAT
CCAAATCCTCATCCCACACACT
GCCAATACGGGTTCTGAACTGT
TGTTGTAATCTTTCTGAATATTTGCTTGCTT
GGACAAGTTGAAACAGATCAGGAAGT
CCTTCAAACTAACACATCATAGACATGCTT
GTCAACAAATGCAGGTAACATAAATGGT
CTCGCCTCTGTCATTGTATTACCTT
TTGCCCATTCACCATCGGAAT
GGAAACCAGCTAGGATTCAGGAA
GGATGTAGAGTGAAATCACCTTCGA
CGCGAGTTAGCTCGAATATTATGATTTC
AGACAATCATGGTGTTTTGAGTCTTTCT
GCTGAGGAAGGATTCTGTATTTGCT
AGCTAGGCTGTAAATGCAAGGAT
GGTTTGAGCCAATCAGTTGTGTT
GAGGATGACACTGTCCGTTTGT
CCGCCTTTCCCACCTTCTC
GGGACCACATAGAAAACATCTGTCT
GGCAGGTAAGTGGAACGTTTTAGAT
CGAAATAAGGGCCTGGTGTTTAAAA
GGATGACTCCTACTAATAGACGGATGT
CACTTTTGACTTTACATGGAACTTAACTCAT
TGGTGAGAGCAGCTTTAAATGTCTT
CCCCATATGAGACGCTACAGTAATG
CCAGGTCGTCTTTATTGCAGATTATCA
TCGTGGATTGTGGCTTACGT
AATGGGTAACAAAGAAATAGCTAGCTACTT
CTGGTCTGTGACGTCAAAATGATG
CATAGTATAGTGATTCGAGTCTGGAGTCT
ATCCAAGGAGCCCCATTAAAGATTT
CAATGTCTAAAGTAATGGTGGTATTCTTGC
CCAGAGGTTAGATGGCCCTTT
GAAAAAGTAAAGTAAAAGTAAAGTATTATACCACTAAAGACAAT
TAACCATGACTTCTATCAATCACCCC
GAAACGTCTATGCTGTCCCCTTTAA
TTTGAGTGAGTCACTGCACCAA
TCAAGACTGTGCTGTAGTTGTCTAC
ATGGGTTGGGATTATGGTTCATTGT
CAAATCAGAACAAAACCTCCCACAA
GGGCAATGGTGGCTATGCT
CTTTTCGGGTTATTCATGCTGTTGT
CTACGCGAGAAATAACACTTTTCAAAACT
TGCTGAGGACCATCTGCAATTC
GCCTACCAGAAAGTACCAATTGTGA
CGGTTCGTGTCCATTGCATTATTAT
CTCATCTCATTATCTACCTTAACGTGCA
)
loci=(
Ots_110495-380
Ots_ARNT
Ots_crRAD18289-33
Ots_crRAD48459-74
Ots_crRAD55400-59
Ots_crRAD57376-68
Ots_100884-287
Ots_101119-381
Ots_101554-407
Ots_101704-143
Ots_101770-82
Ots_102213-210
Ots_102414-395
Ots_102457-132
Ots_102801-308
Ots_102867-609
Ots_103041-52
Ots_103122-180
Ots_104048-194
Ots_104063-132
Ots_104415-88
Ots_105105-613
Ots_105132-200
Ots_105385-421
Ots_105401-325
Ots_105407-117
Ots_105897-124
Ots_106313-729
Ots_106419b-618
Ots_106499-70
Ots_106747-239
Ots_107074-284
Ots_107285-93
Ots_107607-315
Ots_107806-821
Ots_108007-208
Ots_108390-329
Ots_108735-302
Ots_108820-336
Ots_109525-816
Ots_109693-392
Ots_110064-383
Ots_110201-363
Ots_110381-164
Ots_110551-64
Ots_110689-218
Ots_111084b-619
Ots_111312-435
Ots_111681-657
Ots_112208-722
Ots_112301-43
Ots_112419-131
Ots_112820-284
Ots_112876-371
Ots_113242-216
Ots_113457-40R
Ots_115987-325
Ots_117242-136
Ots_117259-271
Ots_117370-471
Ots_117432-409
Ots_118175-479
Ots_118205-61
Ots_118938-325
Ots_120950-417
Ots_122414-56
Ots_123048-521
Ots_123921-111
Ots_124774-477
Ots_126619-400
Ots_127236-62
Ots_127760-569
Ots_128302-57
Ots_128693-461
Ots_128757-61R
Ots_129144-472
Ots_129170-683
Ots_129458-451
Ots_129870-55
Ots_130720-99
Ots_131460-584
Ots_131802-393
Ots_131906-141
Ots_94857-232R
Ots_94903-99R
Ots_95442b-204
Ots_96222-525
Ots_96500-180
Ots_96899-357R
Ots_97077-179R
Ots_97660-56
Ots_98409-850
Ots_98683-796
Ots_99550-204
Ots_afmid-196
Ots_AldB1-122
Ots_aldb-177M
Ots_AldoB4-183
Ots_arp-436
Ots_AsnRS-60
Ots_aspat-196
Ots_BMP2-SNP1
Ots_brp16-64
Ots_Cath_D141
Ots_CCR7
Ots_CD59-2
Ots_CD63
Ots_cgo24-22
Ots_Chin30up-211
Ots_CirpA
Ots_cox1-241
Ots_CRB211
Ots_crRAD10447-25
Ots_crRAD11620-55
Ots_crRAD12037-39
Ots_crRAD12711-37
Ots_crRAD13725-51
Ots_crRAD16540-50
Ots_crRAD17527-58
Ots_crRAD18492-65
Ots_crRAD18937-60
Ots_crRAD20262-46
Ots_crRAD20376-66
Ots_crRAD20887-70
Ots_crRAD21115-24
Ots_crRAD22960-32
Ots_crRAD23631-48
Ots_crRAD24807-74
Ots_crRAD25367-50
Ots_crRAD255-59
Ots_crRAD26081-28
Ots_crRAD26165-69
Ots_crRAD26541-47
Ots_crRAD27164-55
Ots_crRAD27515-69
Ots_crRAD2806-42
Ots_crRAD28677-65
Ots_crRAD292-21
Ots_crRAD30341-48
Ots_crRAD33054-62
Ots_crRAD33491-71
Ots_crRAD34397-33
Ots_crRAD35313-66
Ots_crRAD36072-29
Ots_crRAD36152-44
Ots_crRAD3758-51
Ots_crRAD38095-29
Ots_crRAD38746-36
Ots_crRAD42058-48
Ots_crRAD44588-67
Ots_crRAD46081-56
Ots_crRAD46751-42
Ots_crRAD47297-55
Ots_crRAD5061-27
Ots_crRAD55475-26
Ots_crRAD57520-66
Ots_crRAD57537-24
Ots_crRAD57687-34
Ots_crRAD60614-46
Ots_crRAD60620-51
Ots_crRAD61523-71
Ots_crRAD66330-60
Ots_crRAD69327-53
Ots_crRAD73823-60
Ots_crRAD74766-28
Ots_crRAD75581-70
Ots_crRAD76512-28
Ots_crRAD78968-46
Ots_crRAD92420-25
Ots_crRAD9615-69
Ots_DDX5-171
Ots_E2-275
Ots_EndoRB1-486
Ots_EP-529
Ots_Est1363
Ots_Est740
Ots_ETIF1A
Ots_FARSLA-220
Ots_FGF6A
Ots_FGF6B_1
Ots_GCSH
Ots_GDH-81x
Ots_GH2
Ots_GnRH-271
Ots_GPDH-338
Ots_GPH-318
Ots_GST-207
Ots_GST-375
Ots_GTH2B-550
Ots_HFABP-34
Ots_HMGB1-73
Ots_hnRNPL-533
Ots_hsc71-3-488
Ots_hsc71-5-453
Ots_hsp27b-150
Ots_Hsp90a
Ots_HSP90B-100
Ots_IGF-I.1-76
Ots_Ikaros-250
Ots_IL11
Ots_IL8R_C8
Ots_IsoT
Ots_LEI-292
Ots_LWSop-638
Ots_mapK-3-309
Ots_mapKpr-151
Ots_MetA
Ots_MHC1
Ots_MHC2
Ots_mybp-85
Ots_Myc-366
Ots_myo1a-384
Ots_myoD-364
Ots_NAML12-SNP1
Ots_nelfd-163
Ots_NFYB-147
Ots_nkef-192
Ots_NOD1
Ots_nramp-321
Ots_ntl-255
Ots_Ostm1
Ots_OTALDBINT1-SNP1
Ots_OTDESMIN19-SNP1
Ots_Ots311-101x
Ots_OTSMTA-SNP1
Ots_OTSTF1-SNP1
Ots_P450-288
Ots_P450
Ots_P53
Ots_parp3-286
Ots_PEMT
Ots_PGK-54
Ots_pigh-105
Ots_pop5-96
Ots_ppie-245
Ots_Prl2
Ots_RAD4543-52
Ots_RAG3
Ots_RAS1
Ots_redd1-187
Ots_RFC2-558
Ots_S7-1
Ots_SClkF2R2-135
Ots_sept9-78
Ots_SERPC1-209
Ots_SL
Ots_slc7a2-71
Ots_stk6-516
Ots_SWS1op-182
Ots_TAPBP
Ots_TCTA-58
Ots_TGFB
Ots_Thio
Ots_TLR3
Ots_TNF
Ots_Tnsf
Ots_tpx2-125
Ots_trnau1ap-86
Ots_txnip-321
Ots_u07-07.161
Ots_u07-17.135
Ots_u07-17.373
Ots_u07-18.378
Ots_u07-19.260
Ots_u07-20.332
Ots_u07-25.325
Ots_u07-49.290
Ots_u07-53.133
Ots_u07-57.120
Ots_u07-64.221
Ots_u1002-75
Ots_u1004-117
Ots_u1006-171
Ots_u1007-124
Ots_u1008-108
Ots_u202-161
Ots_u211-85
Ots_U212-158
Ots_U2305-63
Ots_U2362-227
Ots_U2362-330
Ots_U2446-123
Ots_U2567-104
Ots_u4-92
Ots_U5049-250
Ots_U5121-34
Ots_u6-75
Ots_unk1104-38
Ots_unk1832-39
Ots_unk3513-49
Ots_unk526
Ots_unk7936-50
Ots_unk9480-51
Ots_USMG5-67
Ots_vatf-251
Ots_zn593-346
Ots_zP3b-215
Ots_ZR-575
Ots_Greb1_snp1
Ots_Greb1_snp2
)
for j in {1..300}; do
for i in $( grep -n "${seqs[$j]}" /nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RogueR/CORH/sample_fastqs/PT01_A01_CRH01_OtsAC18CORH_0101.fastq | cut -d : -f 1 ); do
sed -n "$((i-1))","$((i+2))"'p ; '"$((i+3))"q /nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RogueR/CORH/sample_fastqs/PT01_A01_CRH01_OtsAC18CORH_0101.fastq >> OtsGTseqLD/"${loci[$j]}".fq
done
done
However, Sandra described this approach as time-consuming (mostly due to sorting off target reads that contained the primer, but not the probe), so it may be good to try an alternative approach before doing this.
In this approach we modify the GTseq pipeline scripts to pull out “on target reads” identified by the pipeline and then just keep a set of these for each individual.
The GTseq_SeqTest.pl script from the GTseq pipeline is able to read in a list of all unique reads (from GTseq_HashSeqs.pl), and count the occurences of the forward and probe sequences. Below, we modify the script so that it also outputs the sequences that match. After a file of matching sequences is created, we create a consensus sequence for each marker, and then use this for mapping.
#!/usr/bin/perl
#GTseq_SeqTest.pl by Nate Campbell, modified by David Dayan to collect matching reads from the input hash file
## comments with double hashes (##) throughout are Dayan's interpretation of the script and may be wrong
#Test .hash file to see how many times loci fwd primers and probe seqs occur.
use strict; use warnings;
die "usage: provide <tab delimited txt file of locus, fwd, probe1, probe2 seqs> and <*.hash>\n" unless @ARGV == 2;
##initialize variables/arrays
my @assays = ();
my @fwd_seq = ();
my @probe1 = ();
my @probe2 = ();
my @probe1RC = ();
my @probe2RC = ();
my @fwd_count = ();
my @probe_count = ();
my @both_count = ();
my @sequences = (); ## initialize new array for the reads
#read in assay and allele information and push to arrays...
open(SEQ, "<$ARGV[0]") or die "error reading $ARGV[0]\n";
## open the key file and write the assay (sequence name), fwd primer, and probe sequences to arrays, also write reverse complements of probe sequences
while (<SEQ>) {
chomp;
my @info = split(/\t/, $_);
push @assays, $info[0];
push @fwd_seq, $info[1];
push @probe1, $info[2];
push @probe2, $info[3];
my $p1RC = reverse $info[2];
my $p2RC = reverse $info[3];
$p1RC =~ tr/ACGT/TGCA/;
$p1RC =~ tr/\]\[/\[\]/;
$p2RC =~ tr/ACGT/TGCA/;
$p2RC =~ tr/\]\[/\[\]/;
push @probe1RC, $p1RC;
push @probe2RC, $p2RC;
}
close SEQ;
my $Targets = @assays; ##creates a variable for each marker
for (my $i = 0; $i < $Targets; $i++){
$fwd_count[$i] = 0;
$probe_count[$i] = 0;
$both_count[$i] = 0;
$sequences[$i] = 0 ; ## add variable to keep sequences within the $Targets variable
open(HASH, "<$ARGV[1]") or die "error reading $ARGV[1]\n";
while (<HASH>) {
my $hash = $_;
my $R1_seq = <HASH>;
chomp ($hash);
my @info = split(/;/, $hash);
my $count = $info[2];
if ($R1_seq =~ m/$fwd_seq[$i]/){
$count = $fwd_count[$i] + $count;
$fwd_count[$i] = $count;
}
$count = $info[2];
if ($R1_seq =~ m/$probe1[$i]|$probe2[$i]|$probe1RC[$i]|$probe2RC[$i]/){
$count = $probe_count[$i] + $count;
$probe_count[$i] = $count;
}
$count = $info[2];
if (($R1_seq =~ m/$fwd_seq[$i]/) && ($R1_seq =~ m/$probe1[$i]|$probe2[$i]|$probe1RC[$i]|$probe2RC[$i]/)) {
$count = $both_count[$i] + $count;
$both_count[$i] = $count;
$sequences[$i] = $R1_seq; ##add sequence
}
}
}
close HASH;
# print "$Allele1_Count[0]\n"; #testing...
for (my $j = 0; $j < $Targets; $j++){
print "$assays[$j],$fwd_count[$j],$probe_count[$j],$both_count[$j],$sequences[$j]\n"; ##added sequence to variable
}
Now run the hash script on an demuxed fastq file, then run the modified script on the hash output
SGE_Batch -c "perl ./scripts/hashseqs.pl ./demuxed_fastq/PT25_C06_ROGR_OmyAC19APPR_0220.fastq > APPR220.hash" -r log_hash
#next create the input file for the gtseq panel
# used the file at /dfs/FW_HMSC/Omalley_Lab/bohns/GTseq/Omy_GTseq390_ProbeSeqs.csv, needed to edit it to be in the right format (locus, fwd, probe1, probe2) (Locus Name,Allele1,Allele2,ProbeSeq1,ProbeSeq2,FWD_Primer,A1_correction,A2_correction)
awk 'BEGIN{FS=",";OFS="\t"} {print $1,$6,$4,$5}' /dfs/FW_HMSC/Omalley_Lab/bohns/GTseq/Omy_GTseq390_ProbeSeqs.csv > probe_seqs.csv
SGE_Batch -c "perl ./scripts/gtseq_seqtest_modified.pl probe_seqs.csv APPR220.hash > seqs.txt" -r log_seqtest
#clean up error in the output
sed -i ':a;N;/\n;/s/\n//;ta;P;D' seqs.txt
sed -i '/^$/d' seqs.txt
#stopped here
#generated a file of unique reads for each marker than could be collapsed into a consensus sequence and mapped however there are some issues with this approach (see approach 3)
The chinook results suggests that there are significant enough number of off target reads that coatin the forward primer that this is not enoguh information to map on. However, adding the probe sequence and the reverse primer may be enough to permit accurate mapping and has the advantage of not having to deal with raw reads. The challenge here will be (a) correctly orienting the primer, probe and reverse primer into a concatenated sequence to map, and (b) optimizing mapping to deal with a gapped alignment (i.e. will be missing some sequence that is not included in the primers or the probe, but do not know exactly how much)
Also of note here is that the GTseq pipeline does not use alignment match reads to primer/probe sequences, instead it simply pulls a a read at a time and checks that it contains an exact match to the forward primer AND either one of the probe sequences or its reverse complement, so making a set of rules for an aligner that is similar should be relatively straightforward.
We mapped reads from 42 Rogue River steelhead after applying the same filtering approach to identify on target reads as the GTseq v3 pipeline. Using these alignments we examined read mapping statistics to identify potentially paralogous markers, identify duplicate markers, examine marker density along the genome and generate a consensus sequence for marker amplicons.
We found that the GTseq pipeline will call genotypes from reads that may be derived from independent, paralogous genomic regions. Of the 390 markers in the panel, 30 demonstrated evidence of multimapping either due to (a) 50% or more of reads assigned as “on-target” strongly aligning to multiple genomic loci (28 markers) or (b) mapQ score of less than 3 (50% probability of incorrect alignment) (16 markers). See figure below. However, these markers do not demonstrate an excess of heterozygosity as might be expected for paralogues. However, this may be due to the use of allele correction values in the GTseq pipeline to call these GTs.
plot_grid(e,f,ncol=2)
There are two exactly identical markers in the panel. These markers vary in probe and primer sequence but tag the same SNP.
5 pairs of markers (over 10 markers) map within 1kb of one another, and 19 pairs (over 24 markers) map within 10kb. These regions of dense genotyping occur over 5 and 8 genomic regions, respectively.
For the 1kb pairs, pairs occur on 4 chromosomes, (2,4,7,28) over 5 regions (there are two regions on chromosome 7 about 47Mb apart).
For the 10kb pairs, pairs occur on 7 chromosomes (2,4,7,16,17,28,29) over 8 regions (two regions on chromosome 7 about 47Mb apart)
How are the markers distributed along the genome?
allmarkers
We can also take a closer look at chromosome 28
kp <- plotKaryotype(genome = Omy.genome, chromosomes = "28")
kpAddBaseNumbers(kp)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker)
and the Greb/Rock1 region
pp <- getDefaultPlotParams(plot.type=2)
pp$bottommargin <- 1
pp$data1height <- 400
kp <- plotKaryotype(plot.type =2, genome = Omy.genome, zoom=greb1_region, plot.params = pp)
kpPlotGenes(kp, data = db28, r0=0, r1=0.2, add.gene.names = TRUE, add.transcript.names = FALSE, add.strand.marks = FALSE, data.panel = 2, plot.transcripts = FALSE, gene.name.position = "bottom", gene.col = "#FDE725FF", clipping = FALSE)
kpAddBaseNumbers(kp, tick.dist = 100000)
kpPlotMarkers(kp, chr=amps$chr, x=amps$start, labels=amps$marker, data.panel = 1, marker.parts = c(0.1,0.1,0.1), clipping = FALSE, ymax=3)
We generated a master spreadsheet of our mapping results and marker meta information with that from the CRITFC (in repo “final_mapping_results.xlsx”). This is available in the associated git-hub repository.
There was generally good agreement between the results. Of the 352 markers that were successfully mapped both here and by the CRITFC lab, 335 aligned to the same locus on the reference genome. Mapping quaility was highly correlated between the two mapping analyses, and markers identified as potentially multi mapping were also in good agreement. Markers with disagreements in mapping locus were enriched for markers with a high number of secondary alignments.